12 research outputs found

    Frequency Domain Methods for Coding the Linear Predictive Residual of Speech Signals

    Get PDF
    The most frequently used speech coding paradigm is ACELP, famous because it encodes speech with high quality, while consuming a small bandwidth. ACELP performs linear prediction filtering in order to eliminate the effect of the spectral envelope from the signal. The noise-like excitation is then encoded using algebraic codebooks. The search of this codebook, however, can not be performed optimally with conventional encoders due to the correlation between their samples. Because of this, more complex algorithms are required in order to maintain the quality. Four different transformation algorithms have been implemented (DCT, DFT, Eigenvalue decomposition and Vandermonde decomposition) in order to decorrelate the samples of the innovative excitation in ACELP. These transformations have been integrated in the ACELP of the EVS codec. The transformed innovative excitation is coded using the envelope based arithmetic coder. Objective and subjective tests have been carried out to evaluate the quality of the encoding, the degree of decorrelation achieved by the transformations and the computational complexity of the algorithms

    The Use of Audio Fingerprints for Authentication of Speakers on Speech Operated Interfaces

    No full text
    In a multi-speaker and multi-device environment, we need acoustic fingerprint information for authentication between devices. Thus, in these kinds of environments, it is crucial to continuously check the authenticity of speakers and devices within a short duration since different speakers could join or leave the environment. In this work, we propose the provision of different levels of authentication to different speakers in a multi-speaker multi-device environment using acoustic audio fingerprint information. Firstly, the audio fingerprints are extracted continuously every few seconds. Then, the extracted fingerprints are passed to a speaker recognition module which checks if the fingerprint is enrolled for that particular environment or not. Finally, the proper level of authentication is provided for each speaker. Our experimental results on Voxceleb-1 dataset show that acoustic fingerprints can be successfully used for authentication purposes in a multi-speaker multi-device environment.Peer reviewe

    Acoustic Fingerprints for Access Management in Ad-Hoc Sensor Networks

    No full text
    Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users’ voices are not shared with unauthorized devices, it is however necessary to design an access management system that adapts to the users’ needs. Prior work has demonstrated that a combination of audio fingerprinting and fuzzy cryptography yields a robust pairing of devices without sharing the information that they record. However, the robustness of these systems is partially based on the extensive duration of the recordings that are required to obtain the fingerprint. This paper analyzes methods for robust generation of acoustic fingerprints in short periods of time to enable the responsive pairing of devices according to changes in the acoustic scenery and can be integrated into other typical speech processing tools.Peer reviewe

    Provable Consent for Voice User Interfaces

    No full text
    The proliferation of acoustic human-computer interaction raises privacy concerns since it allows Voice User Interfaces (VUI) to overhear human speech and to analyze and share content of overheard conversation in cloud datacenters and with third parties. This process is non-transparent regarding when and which audio is recorded, the reach of the speech recording, the information extracted from a recording and the purpose for which it is used. To return control over the use of audio content to the individual who generated it, we promote intuitive privacy for VUIs, featuring a lightweight consent mechanism as well as means of secure verification (proof of consent) for any recorded piece of audio. In particular, through audio fingerprinting and fuzzy cryptography, we establish a trust zone, whose area is implicitly controlled by voice loudness with respect to environmental noise (Signal-to-Noise Ratio (SNR)). Secure keys are exchanged to verify consent on the use of an audio sequence via digital signatures. We performed experiments with different levels of human voice, corresponding to various trust situations (e.g. whispering and group discussion). A second scenario was investigated in which a VUI outside of the trust zone could not obtain the shared secret key.Peer reviewe

    Cancellation of Local Competing Speaker with Near-ïŹeld Localization for Distributed Ad-Hoc Sensor Network

    No full text
    In scenarios such as remote work, open offices and call centers, multiple people may simultaneously have independent spoken interactions with their devices in the same room. The speech of competing speakers will however be picked up by all microphones, both reducing the quality of audio and exposing speakers to breaches in privacy. We propose a cooperative cross-talk cancellation solution breaking the single active speaker assumption employed by most telecommunication systems. The proposed method applies source separation on the microphone signals of independent devices, to extract the dominant speaker in each device. It is realized using a localization estimator based on a deep neural network, followed by a time-frequency mask to separate the target speech from the interfering one at each time-frequency unit referring to its orientation. By experimental evaluation, we confirm that the proposed method effectively reduces crosstalk and exceeds the baseline expectation maximization method by 10 dB in terms of interference rejection. This performance makes the proposed method a viable solution for cross-talk cancellation in near-field conditions, thus protecting the privacy of external speakers in the same acoustic space.Peer reviewe

    Perception of Privacy Measured in the Crowd−Paired Comparison on the Effect of Background Noises

    No full text
    Voice based devices and virtual assistants are widely integrated into our daily life, but the growing popularity has also raised concerns about data privacy in processing and storage. While improvements in technology and data protection regulations have been made to provide users a more secure experience, the concept of privacy continues to be subject to enormous challenges. We can observe that people intuitively adjust their way of talking in a human-to-human conversation, an intuition that devices could benefit from to increase their level of privacy. In order to enable devices to quantify privacy in an acoustic scenario, this paper focuses on how people perceive privacy with respect to environmental noise. We measured privacy scores on a crowdsourcing platform with a paired comparison listening test and obtained reliable and consistent results. Our measurements show that the experience of privacy varies depending on the acoustic features of the ambient noise. Furthermore, multiple probabilistic choice models were fitted to the data to obtain a meaningful ordering of noise scenarios conveying listeners' preferences. A preference tree model was found to fit best, indicating that subjects change their decision strategy depending on the scenarios under test.Peer reviewe

    Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks

    No full text
    The use of speech source localization (SSL) and its applications offer great possibilities for the design of speaker local positioning systems with wireless acoustic sensor networks (WASNs). Recent works have shown that data-driven front-ends can outperform traditional algorithms for SSL when trained to work in specific domains, depending on factors like reverberation and noise levels. However, such localization models consider localization directly from raw sensor observations, without consideration for transmission losses in WASNs. In contrast, when sensors reside in separate real-life devices, we need to quantize, encode and transmit sensor data, decreasing the performance of localization, especially when the transmission bitrate is low. In this work, we investigate the effect of low bitrate transmission on a Direction of Arrival (DoA) estimator. We analyze a deep neural network (DNN) based framework performance as a function of the audio encoding bitrate for compressed signals by employing recent communication codecs including PyAWNeS, Opus, EVS, and Lyra. Experimental results show that training the DNN on input encoded with the PyAWNeS codec at 16.4 kB/s can improve the accuracy significantly, and up to 50% of accuracy degradation at a low bitrate for almost all codecs can be recovered. Our results further show that for the best accuracy of the trained model when one of the two channels can be encoded with a bitrate higher than 32 kB/s, it is optimal to have the raw data for the second channel. However, for a lower bitrate, it is preferable to similarly encode the two channels. More importantly, for practical applications, a more generalized model trained with a randomly selected codec for each channel, shows a large accuracy gain when at least one of the two channels is encoded with PyAWNeS.Peer reviewe

    Introduction to Speech Processing

    No full text
    This is a snapshot of the wiki-format book on 31.12.2020.Non peer reviewe

    PyAWNeS-Codec: Speech and audio codec for ad-hoc acoustic wireless sensor networks

    No full text
    Existing hardware with microphones can potentially be used as sensor networks to capture speech and audio signals for the benefit of better signal quality than possible with a single microphone. A central pre-requisite for such ad-hoc acoustic wireless sensor networks (ASWNs) is an efficient communication protocol with which to transmit audio data between nodes. For that purpose, we present the world's-first speech and audio codec especially designed for ASWNs, which has competitive quality also in single-channel operation. To ensure quality in the single-channel scenario, it closely resembles conventional codecs of the TCX-type, but extended with features to facilitate multi-device operation, including dithered quantization, delay estimation and compensation, as well as multi-channel post-filtering. The codec is intended to become a baseline for future research and we therefore provide it as an open-access library. Our experiments confirm that performance is in the same range as recent commercial single-channel codecs and that added devices improve quality.Peer reviewe

    Sound Privacy: A Conversational Speech Corpus for Quantifying the Experience of Privacy

    No full text
    With the growing popularity of social networks, cloud services and online applications, people are becoming concerned about the way companies store their data and the ways in which the data can be applied. Privacy with devices and services operated by the voice are of particular interest. To enable studies in privacy, this paper presents a database which quantiïŹes the experience of privacy users have in spoken communication. We focus on the effect of the acoustic environment on that perception of privacy. Speech signals are recorded in scenarios simulating real-life situations, where the acoustic environment has an effect on the experience of privacy. The acoustic data is complemented with measures of the speakers’ experience of privacy, recorded using a questionnaire. The presented corpus enables studies in how acoustic environments affect peoples’ experience of privacy, which in turn, can be used to develop speech operated applications which are respectful of their right to privacy.Peer reviewe
    corecore